{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with Multiple DataFrames\n", "
\n", "\n", "Often you'll work with multiple dataframes that you want to stick together or merge. `df.merge()` and `df.concat()` are all you need to know for combining dataframes. The Pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) is very helpful for these functions, but they are pretty easy to grasp.\n", "\n", "```{note}\n", "The example joins shown in this section are inspired by [Chapter 15](https://stat545.com/join-cheatsheet.html) of Jenny Bryan's STAT 545 materials.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sticking DataFrames Together with `pd.concat()`\n", "
\n", "You can use `pd.concat()` to stick dataframes together:\n", "- Vertically: if they have the same **columns**, OR\n", "- Horizontally: if they have the same **rows**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df1 = pd.DataFrame({'A': [1, 3, 5],\n", " 'B': [2, 4, 6]})\n", "df2 = pd.DataFrame({'A': [7, 9, 11],\n", " 'B': [8, 10, 12]})" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
134
256
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 3 4\n", "2 5 6" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
078
1910
21112
\n", "
" ], "text/plain": [ " A B\n", "0 7 8\n", "1 9 10\n", "2 11 12" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
134
256
078
1910
21112
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 3 4\n", "2 5 6\n", "0 7 8\n", "1 9 10\n", "2 11 12" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat((df1, df2), axis=0) # axis=0 specifies a vertical stick, i.e., on the columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the indexes were simply joined together? This may or may not be what you want. To reset the index, you can specify the argument `ignore_index=True`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
134
256
378
4910
51112
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 3 4\n", "2 5 6\n", "3 7 8\n", "4 9 10\n", "5 11 12" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat((df1, df2), axis=0, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `axis=1` to stick together horizontally:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01278
134910
2561112
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1 2 7 8\n", "1 3 4 9 10\n", "2 5 6 11 12" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat((df1, df2), axis=1, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are not limited to just two dataframes, you can concatenate as many as you want:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
134
256
378
4910
51112
612
734
856
978
10910
111112
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 3 4\n", "2 5 6\n", "3 7 8\n", "4 9 10\n", "5 11 12\n", "6 1 2\n", "7 3 4\n", "8 5 6\n", "9 7 8\n", "10 9 10\n", "11 11 12" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat((df1, df2, df1, df2), axis=0, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joining DataFrames with `pd.merge()`\n", "
\n", "\n", "`pd.merge()` gives you the ability to \"join\" dataframes using different rules (just like with SQL if you're familiar with it). You can use `df.merge()` to join dataframes based on shared `key` columns. Methods include:\n", "- \"inner join\"\n", "- \"outer join\"\n", "- \"left join\"\n", "- \"right join\"\n", "\n", "See this great [cheat sheet](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join) and [these great animations](https://github.com/gadenbuie/tidyexplain) for more insights." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({\"name\": ['Magneto', 'Storm', 'Mystique', 'Batman', 'Joker', 'Catwoman', 'Hellboy'],\n", " 'alignment': ['bad', 'good', 'bad', 'good', 'bad', 'bad', 'good'],\n", " 'gender': ['male', 'female', 'female', 'male', 'male', 'female', 'male'],\n", " 'publisher': ['Marvel', 'Marvel', 'Marvel', 'DC', 'DC', 'DC', 'Dark Horse Comics']})\n", "df2 = pd.DataFrame({'publisher': ['DC', 'Marvel', 'Image'],\n", " 'year_founded': [1934, 1939, 1992]})" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisher
0MagnetobadmaleMarvel
1StormgoodfemaleMarvel
2MystiquebadfemaleMarvel
3BatmangoodmaleDC
4JokerbadmaleDC
5CatwomanbadfemaleDC
6HellboygoodmaleDark Horse Comics
\n", "
" ], "text/plain": [ " name alignment gender publisher\n", "0 Magneto bad male Marvel\n", "1 Storm good female Marvel\n", "2 Mystique bad female Marvel\n", "3 Batman good male DC\n", "4 Joker bad male DC\n", "5 Catwoman bad female DC\n", "6 Hellboy good male Dark Horse Comics" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
publisheryear_founded
0DC1934
1Marvel1939
2Image1992
\n", "
" ], "text/plain": [ " publisher year_founded\n", "0 DC 1934\n", "1 Marvel 1939\n", "2 Image 1992" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An \"inner\" join will return all rows of `df1` where matching values for \"publisher\" are found in `df2`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded
0MagnetobadmaleMarvel1939
1StormgoodfemaleMarvel1939
2MystiquebadfemaleMarvel1939
3BatmangoodmaleDC1934
4JokerbadmaleDC1934
5CatwomanbadfemaleDC1934
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded\n", "0 Magneto bad male Marvel 1939\n", "1 Storm good female Marvel 1939\n", "2 Mystique bad female Marvel 1939\n", "3 Batman good male DC 1934\n", "4 Joker bad male DC 1934\n", "5 Catwoman bad female DC 1934" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"inner\", on=\"publisher\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](inner_join.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An \"outer\" join will return all rows of `df1` and `df2`, placing NaNs where information is unavailable:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded
0MagnetobadmaleMarvel1939.0
1StormgoodfemaleMarvel1939.0
2MystiquebadfemaleMarvel1939.0
3BatmangoodmaleDC1934.0
4JokerbadmaleDC1934.0
5CatwomanbadfemaleDC1934.0
6HellboygoodmaleDark Horse ComicsNaN
7NaNNaNNaNImage1992.0
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded\n", "0 Magneto bad male Marvel 1939.0\n", "1 Storm good female Marvel 1939.0\n", "2 Mystique bad female Marvel 1939.0\n", "3 Batman good male DC 1934.0\n", "4 Joker bad male DC 1934.0\n", "5 Catwoman bad female DC 1934.0\n", "6 Hellboy good male Dark Horse Comics NaN\n", "7 NaN NaN NaN Image 1992.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"outer\", on=\"publisher\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return all rows from `df1` and all columns of `df1` and `df2`, populated where matches occur:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded
0MagnetobadmaleMarvel1939.0
1StormgoodfemaleMarvel1939.0
2MystiquebadfemaleMarvel1939.0
3BatmangoodmaleDC1934.0
4JokerbadmaleDC1934.0
5CatwomanbadfemaleDC1934.0
6HellboygoodmaleDark Horse ComicsNaN
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded\n", "0 Magneto bad male Marvel 1939.0\n", "1 Storm good female Marvel 1939.0\n", "2 Mystique bad female Marvel 1939.0\n", "3 Batman good male DC 1934.0\n", "4 Joker bad male DC 1934.0\n", "5 Catwoman bad female DC 1934.0\n", "6 Hellboy good male Dark Horse Comics NaN" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"left\", on=\"publisher\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](left_join.png)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded
0BatmangoodmaleDC1934
1JokerbadmaleDC1934
2CatwomanbadfemaleDC1934
3MagnetobadmaleMarvel1939
4StormgoodfemaleMarvel1939
5MystiquebadfemaleMarvel1939
6NaNNaNNaNImage1992
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded\n", "0 Batman good male DC 1934\n", "1 Joker bad male DC 1934\n", "2 Catwoman bad female DC 1934\n", "3 Magneto bad male Marvel 1939\n", "4 Storm good female Marvel 1939\n", "5 Mystique bad female Marvel 1939\n", "6 NaN NaN NaN Image 1992" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"right\", on=\"publisher\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many ways to specify the `key` to join dataframes on, you can join on index values, different, column names, etc. Another helpful argument is the `indicator` argument which will add a column to the result telling you where matches were found in the dataframes:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded_merge
0MagnetobadmaleMarvel1939.0both
1StormgoodfemaleMarvel1939.0both
2MystiquebadfemaleMarvel1939.0both
3BatmangoodmaleDC1934.0both
4JokerbadmaleDC1934.0both
5CatwomanbadfemaleDC1934.0both
6HellboygoodmaleDark Horse ComicsNaNleft_only
7NaNNaNNaNImage1992.0right_only
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded _merge\n", "0 Magneto bad male Marvel 1939.0 both\n", "1 Storm good female Marvel 1939.0 both\n", "2 Mystique bad female Marvel 1939.0 both\n", "3 Batman good male DC 1934.0 both\n", "4 Joker bad male DC 1934.0 both\n", "5 Catwoman bad female DC 1934.0 both\n", "6 Hellboy good male Dark Horse Comics NaN left_only\n", "7 NaN NaN NaN Image 1992.0 right_only" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"outer\", on=\"publisher\", indicator=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the way, you can use `pd.concat()` to do a simple \"inner\" or \"outer\" join on multiple datadrames at once. It's less flexible than merge, but can be useful sometimes." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }